{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 19 - k-nearest neighbors\n", "\n", "The *k-nearest neighbors* algorithm predicts based on the values of the k closest training data. For example, a 3-nearest neighbor algorithm will find the 3 closest data points (using the Euclidean distance) in the training data and use them to make a prediction.\n", "\n", "If we are classifying (trying to predict qualitative value), the prediction is the class that appears the most in the k neighbors.\n", "\n", "If we are performing regression (trying to predict a quantitative value), the prediction is the mean of the y values of the k neighbors.\n", "\n", "## Classifier\n", "\n", "We will return to the city services survey data from Lab 12 (Decision tree classifiers). Recall that this data is collected by the city of [Somerville, MA](https://en.wikipedia.org/wiki/Somerville,_Massachusetts) asking residents about their happiness, as well as ratings of city services. \n", "\n", "The link to download the data is [https://archive.ics.uci.edu/ml/machine-learning-databases/00479/SomervilleHappinessSurvey2015.csv](https://archive.ics.uci.edu/ml/machine-learning-databases/00479/SomervilleHappinessSurvey2015.csv)\n", "\n", "The data columns are:\n", "\n", "- D = decision attribute (D) with values 0 (unhappy) and 1 (happy) \n", "- X1 = the availability of information about the city services \n", "- X2 = the cost of housing \n", "- X3 = the overall quality of public schools \n", "- X4 = your trust in the local police \n", "- X5 = the maintenance of streets and sidewalks \n", "- X6 = the availability of social community events \n", "\n", "Attributes X1 to X6 have values 1 to 5." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", " \n", "from sklearn.preprocessing import MinMaxScaler\n", " \n", "from sklearn.model_selection import train_test_split\n", "\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.neighbors import KNeighborsRegressor\n", "\n", "from sklearn.metrics import confusion_matrix\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As in Lab 12, we will read the data into the dataframe `city`, giving the columns more descriptive names in the process." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_column_names = [\"happy\",\"city_info\",\"housing_cost\", \"school_quality\", \\\n", " \"trust_police\", \"streets_sidewalks\", \"community_events\"]\n", "city = pd.read_csv(\"../data/SomervilleHappinessSurvey2015.csv\", \\\n", " encoding = \"utf-16le\",names = new_column_names, \\\n", " header = 0)\n", "city.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a variable `X` to contain all columns except `happy`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "X = city.iloc[:,1:7]\n", "
\n", "\n", "Define a variable y to be the `happy` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "y =city[\"happy\"]\n", "
\n", "\n", "Split your X and y data into training and testing data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)\n", "
\n", "\n", "The following code creates a 3-nearest neighbor classifier (k = 3), fits the training data to it, and makes predictions for the test data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "k3nn = KNeighborsClassifier(n_neighbors = 3)\n", "k3nn.fit(X_train, y_train)\n", "y_pred = k3nn.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute a confusion matrix for the true values and predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "confusion_matrix(y_test, y_pred, labels = [1,0])\n", "
\n", "\n", "Compute the sensitivity, specificity, precision, and accuracy from the confusion matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "tn, fn, fp, tp = confusion_matrix(y_test, y_pred, labels = [1,0]).ravel()\n", "\n", "sensitivity = tp/(tp + fn)\n", "specificity = tn/(tn + fp)\n", "precision = tp/(tp + fp)\n", "accuracy = (tp + tn)/(tp + tn + fp + fn)\n", "\n", "print(\"Sensitivity:\",sensitivity)\n", "print(\"Specificity:\",specificity)\n", "print(\"Precision:\", precision)\n", "print(\"Accuracy:\",accuracy)\n", "
\n", "\n", "How does changing k, the number of neighbors used to make the prediction, affect the performance of this classifier?\n", "\n", "The results from the decision tree in Lab 12 were: \n", "\n", "Sensitivity: 0.5584415584415584\n", "\n", "Specificity: 0.8181818181818182\n", "\n", "Precision: 0.7818181818181819\n", "\n", "Accuracy: 0.6783216783216783\n", "\n", "How does the k-nearest neighbor classifier compare to the decision tree classifier?\n", "\n", "## Regressor\n", "\n", "To test k-nearest neighbors for regression, we will use the insurance data from Labs 7, 8, and 13. Recall we are trying to predict the insurance cost, a quantitative value. \n", "\n", "If you don't have the dataset, download it from GitHub: [https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv](https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv)\n", "\n", "In this data, each row represents an insurance policy and the 7 columns contain the following information about it:\n", "- age: age of policy holder\n", "- sex: sex of policy holder\n", "- bmi: boday mass index (bmi) of policy holder. bmi is a (sometimes unreliable) measurement of body fat in adults\n", "- children: number of children (dependents) on the policy\n", "- smoker: whether the policy holder is a smoker\n", "- region: region of the country the policy holder lives in\n", "- charges: price for insurance policy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in the insurance data, replacing the qualitative columns with dummy variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create an X variable with the independent variable columns (everything except the charges column)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a y variable with the `charges` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Split your X and y data into training and testing data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code creates a 3-nearest neighbor regressor (k = 3), fits the training data to it, and makes predictions for the test data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ik3nn = KNeighborsRegressor(n_neighbors = 3)\n", "ik3nn.fit(iX_train, iy_train)\n", "iy_pred = ik3nn.predict(iX_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the mean squared error for your predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scaling data (aka normalization)\n", "\n", "When the columns have different scales, the largest column will dominate. We can get better results by scaling all of our columns to be between 0 and 1. The scaling formula is:\n", "\n", "$$x_{scaled} = \\frac{x - x_{\\min}}{x_{\\max} - x_{\\min}}$$\n", "\n", "We can use a built in function in sci-kit learn to do the scaling:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "scaler = MinMaxScaler(feature_range=(0, 1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iX_train_scaled = scaler.fit_transform(iX_train)\n", "iX_train_scaled" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scale your X test data. We do not need to scale the y data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Built a 3-nearest neighbor regressor with the scaled training data and use it to make predictions for the scaled test data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the new mean squared error. Does scaling improve the 3-nearest neighbor regressor?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To figure out which value of k to use, we can write a loop to try all values of k between 1 and 20, and compute the mean squared error for each one. The pseudo-code to do this is:\n", "\n", "\n", "create an empty list\n", "loop k from 1 to 20:\n", " create a k-nearest neighbor regressor\n", " fit the training data to the k-nearest neighbor regressor\n", " make predictions for the test data\n", " compute the mean squared error for the predictions\n", " store the mean squared error in the list\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mses = []\n", "for k in range(1,21):\n", " iknn_scaled = KNeighborsRegressor(n_neighbors = k)\n", " iknn_scaled.fit(iX_train_scaled, iy_train)\n", " iy_pred_scaled = iknn_scaled.predict(iX_test_scaled)\n", " mse = ((iy_pred_scaled - iy_test)**2).mean()\n", " mses.append(mse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "\n", "mses = []\n", "for k in range(1,21):\n", " iknn_scaled = KNeighborsRegressor(n_neighbors = k)\n", " iknn_scaled.fit(iX_train_scaled, iy_train)\n", " iy_pred_scaled = iknn_scaled.predict(iX_test_scaled)\n", " mse = ((iy_pred_scaled - iy_test)**2).mean()\n", " mses.append(mse)\n", "\n", "
\n", "\n", "Plot the list of mean squared errors. The lowest one will correspond to the best k." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just as with linear regression, we can see if there is a pattern to which values are predicted correctly and which are not. Plot a scatter plot with the true y test values on the x axis, and the predicted value - the true value on the y axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }